用 NumPy 就能辨識數字



In [1]:

    
from PIL import Image
import numpy as np

先下載 MNIST 資料



In [2]:

    
import os
import urllib
from urllib.request import urlretrieve
dataset = 'mnist.pkl.gz'
def reporthook(a,b,c):
    print("\rdownloading: %5.1f%%"%(a*b*100.0/c), end="")
    
if not os.path.isfile(dataset):
        origin = "https://github.com/mnielsen/neural-networks-and-deep-learning/raw/master/data/mnist.pkl.gz"
        print('Downloading data from %s' % origin)
        urlretrieve(origin, dataset, reporthook=reporthook)



In [3]:

    
import gzip
import pickle
with gzip.open(dataset, 'rb') as f:
    train_set, validation_set, test_set = pickle.load(f, encoding='latin1')

Q

先看看這些資料是什麼吧！



In [4]:

    
%run -i q_see_mnist_data.py









    



train_set[0]:      shape=(50000, 784)	 dtype=float32
train_set[1]:      shape=(50000,)	 dtype=int64
validation_set[0]: shape=(10000, 784)	 dtype=float32
validation_set[1]: shape=(10000,)	 dtype=int64
test_set[0]:       shape=(10000, 784)	 dtype=float32
test_set[1]:       shape=(10000,)	 dtype=int64

Supervised Learning

類比：類比：中文房間



In [5]:

    
train_X, train_y = train_set
test_X, test_y = test_set

看一下 MNIST 的 y 是什麼



In [6]:

    
# 訓練資料， y 的前 20 筆
train_y[:20]









    Out[6]:





array([5, 0, 4, 1, 9, 2, 1, 3, 1, 4, 3, 5, 3, 6, 1, 7, 2, 8, 6, 9])

看一下 MNIST 的 X



In [7]:

    
from IPython.display import display
def showX(X):
    int_X = (X*255).clip(0,255).astype('uint8')
    # N*784 -> N*28*28 -> 28*N*28 -> 28 * 28N
    int_X_reshape = int_X.reshape(-1,28,28).swapaxes(0,1).reshape(28,-1)
    display(Image.fromarray(int_X_reshape))
# 訓練資料， X 的前 20 筆
showX(train_X[:20])

先從簡單的方法開始

笨方法，直接比較，找最接近的圖。

先試試看使用方差好了

Q

計算 u, v 方差

u = train_X[0]
v = train_X[1]



In [8]:

    
%run -i q_square_error.py









    



86.9492
86.9491830207

Q

試著

顯示 test_X[0]
在 train_X 中找出最像 test_X[0] 的圖片編號
display 那張最圖片
然後查看對應的 train_y
看看 test_y[0]



In [9]:

    
%run -i q_find_nn_0.py









    












    



train_X[38620]






    












    



train_X[38620] = 7
test_y[0] = 7

Q

拿前面10/100個 test_X 做同樣的事情，然後統計一準確度。



In [10]:

    
%run -i q_find_nn_10.py









    



test_X[0]






    












    



train_X[38620]






    












    



train_X[38620] = 7
train_X[0] = 7
test_X[1]






    












    



train_X[28882]






    












    



train_X[28882] = 2
train_X[1] = 2
test_X[2]






    












    



train_X[46512]






    












    



train_X[46512] = 1
train_X[2] = 1
test_X[3]






    












    



train_X[29044]






    












    



train_X[29044] = 0
train_X[3] = 0
test_X[4]






    












    



train_X[40094]






    












    



train_X[40094] = 4
train_X[4] = 4
test_X[5]






    












    



train_X[30809]






    












    



train_X[30809] = 1
train_X[5] = 1
test_X[6]






    












    



train_X[18279]






    












    



train_X[18279] = 4
train_X[6] = 4
test_X[7]






    












    



train_X[41982]






    












    



train_X[41982] = 9
train_X[7] = 9
test_X[8]






    












    



train_X[35628]






    












    



train_X[35628] = 5
train_X[8] = 5
test_X[9]






    












    



train_X[5044]






    












    



train_X[5044] = 9
train_X[9] = 9
Accuracy 1.0

Q

如果 train_X 只有 500 筆資料呢？

利用 reshaping, broadcasting 技巧，算出對 test_X[:100] 的準確度！

Hint: np.expand_dims 來取代 np.reshape



In [11]:

    
# !可能會用掉太多記憶體!
#%run -i q_small_data.py
# accuracy: 85%

Q

用其他距離函數? e.g. np.abs(...).sum()

改用來內積取代方差

$$ \begin{align*} \left\Vert \mathbf{u}-\mathbf{v}\right\Vert ^{2} & =\left(\mathbf{u}-\mathbf{v}\right)\cdot\left(\mathbf{u}-\mathbf{v}\right)\\ & =\left\Vert \mathbf{u}\right\Vert ^{2}-2\mathbf{u}\cdot\mathbf{v}+\left\Vert \mathbf{v}\right\Vert ^{2}\\ \end{align*} $$



In [12]:

    
# 資料 normalize
train_X  = train_X / np.linalg.norm(train_X, axis=1, keepdims=True)
test_X  = test_X / np.linalg.norm(test_X, axis=1, keepdims=True)



In [13]:

    
# 矩陣乘法 == 大量計算內積
A = test_X @ train_X.T
print(A.shape)









    



(10000, 50000)



In [14]:

    
A.argmax(axis=1)









    Out[14]:





array([44566, 28882, 15224, ...,  3261,  1311, 22424])



In [15]:

    
predict_y = train_y[A.argmax(axis=1)]



In [16]:

    
# 測試資料， X 的前 20 筆
showX(test_set[0][:20])



In [17]:

    
# 猜測的 Y 前20筆
predict_y[:20]









    Out[17]:





array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4])



In [18]:

    
#測試資料的 y 前 20 筆
test_y[:20]









    Out[18]:





array([7, 2, 1, 0, 4, 1, 4, 9, 5, 9, 0, 6, 9, 0, 1, 5, 9, 7, 3, 4])



In [19]:

    
# 正確率
(predict_y == test_y).mean()









    Out[19]:





0.9708

用 PCA 降低維度



In [20]:

    
from sklearn.decomposition import PCA
pca = PCA(n_components=60)
train_X = pca.fit_transform(train_set[0])
test_X = pca.transform(test_set[0])



In [21]:

    
train_X.shape









    Out[21]:





(50000, 60)



In [22]:

    
train_X /= np.linalg.norm(train_X, axis=1, keepdims=True)
test_X /= np.linalg.norm(test_X, axis=1, keepdims=True)



In [23]:

    
# 矩陣乘法
A = test_X @ train_X.T



In [24]:

    
predict_y = train_y[A.argmax(axis=1)]



In [25]:

    
# 正確率
(predict_y == test_y).mean()









    Out[25]:





0.97030000000000005

Q

試試看不同的維度
檢查看看 PCA 前後，定義的誤差函數之間的差別。
使用 sklearn knn